library(ggplot2)
library(ggcorrplot)
library(corrplot)
library(lessR)
library(CARS)
library(olsrr)
data <- read.csv(file="ames_housing_data.csv",head=TRUE,sep=",")
data_cont <- data[,c("LotFrontage","LotArea","MasVnrArea","BsmtFinSF1","BsmtFinSF2","TotalBsmtSF", "FirstFlrSF","SecondFlrSF","LowQualFinSF","GrLivArea","GarageArea","WoodDeckSF", "OpenPorchSF","EnclosedPorch", "ThreeSsnPorch", "ScreenPorch","PoolArea","MiscVal","SalePrice")]
Initial views indicate that GR Living area may be the “best” continuous variable to start. Histogram shows a fairly “normal” distribution (with some outliers), and correlation to sale price is the highest at 71%. QQ plot appears to show it’s one of the more normal ones among all continuous variables.
model1 <-lm(SalePrice~GrLivArea, data=data_cont)
summary(model1)
##
## Call:
## lm(formula = SalePrice ~ GrLivArea, data = data_cont)
##
## Residuals:
## Min 1Q Median 3Q Max
## -483467 -30219 -1966 22728 334323
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13289.634 3269.703 4.064 0.0000494
## GrLivArea 111.694 2.066 54.061 < 0.0000000000000002
##
## Residual standard error: 56520 on 2928 degrees of freedom
## Multiple R-squared: 0.4995, Adjusted R-squared: 0.4994
## F-statistic: 2923 on 1 and 2928 DF, p-value: < 0.00000000000000022
Make a scatterplot of Y and X, and overlay the regression line on the cloud of data.
***
ggplot(data_cont, aes(x=GrLivArea, y=SalePrice)) +
geom_point(col = "purple") +
geom_smooth(method='lm', fill = "red") +
theme_classic() + theme(axis.text.x = element_text(angle = 45, hjust=1)) +
labs(title = "Housing Price",x="Above Ground living area", y="Sale Price")
The equation for the linear regression using Above ground living area is \(Y = 13289.634 + 111.694*X_{1}\). 13289.634 is the intercept, which says that if the above ground living area is 0, then the housing price is predicted to be 13289.634. This may or may not make sense. A house with 0 square feet does not seem logical, but it could mean it’s the price of the lot(the land).
\(B_{1}\) is 111.694, which indicates that for every squarefeet change, price increases about $111.70.
The R-squared is 49.95%, which says that this variable explains about half the variance in the price. Adjusted R squared is similar, which is expected as there are not other variables to consider.
\(H_{0} : B_{1} = 0\)
\(H_{1} : B_{1} \neq 0\)
This is to say that the null hypothesis is that squarefootage of above ground does not affect sale price.
The alternate hypothesis is to reject the Null.
The omnibus hypotheis is the same as we are only using a single variable.
Based on the anova output, the F statistic is significant, so we reject the null hypothesis that the square footage of above grade is 0.
anova(model1)
## Analysis of Variance Table
##
## Response: SalePrice
## Df Sum Sq Mean Sq F value Pr(>F)
## GrLivArea 1 9337629924310 9337629924310 2922.6 < 0.00000000000000022
## Residuals 2928 9354907186041 3194981962
#a - predicted values of model
model1_predicted_values<- model1$fitted.values
data_cont$predicted_values <- model1_predicted_values
#b - calculating the residuals
data_cont$residuals <-data_cont$SalePrice - data_cont$predicted_values
#mean of residuals
sd_residuals <- sd(data_cont$residuals)
#standardizing the residuals
data_cont$standardized_residuals <- data_cont$residuals/sd_residuals
The standardized residuals look normal from a histogram standpoint as they center around zero, but we do see observation around 5 and more standard deviations.
ggplot(data_cont, aes(x=data_cont$standardized_residuals)) + geom_histogram(fill = "purple") + theme_classic() + labs(title = "Standarized Residuals", x="standard deviation")
As the predicted values get higher, the residuals start to “fan out”. So for the lower home prices, we seem to be predicting well, but as the price gets higher and higher, our difference between y and y_hat gets larger and larger.
From a outliers and influential points perspective, there are a few observations THAT are over $500,000. These can potentially be influential variables that is driving the results.
ggplot(data_cont, aes(x=data_cont$predicted_values,y=data_cont$standardized_residuals)) +
geom_point(col = "purple") +
theme_classic() + theme(axis.text.x = element_text(angle = 45, hjust=1)) + labs(Title = "standarized residuals", y="standarized residuals", x = "predicted Sale Price")
# 2
model2_data <- data[,c("OverallQual","SalePrice")]
model2 <-lm(SalePrice ~ OverallQual, data=model2_data)
summary(model2)
##
## Call:
## lm(formula = SalePrice ~ OverallQual, data = model2_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -197507 -29254 -2283 21658 397493
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -95003.6 3933.8 -24.15 <0.0000000000000002
## OverallQual 45251.0 628.8 71.96 <0.0000000000000002
##
## Residual standard error: 48020 on 2928 degrees of freedom
## Multiple R-squared: 0.6388, Adjusted R-squared: 0.6387
## F-statistic: 5179 on 1 and 2928 DF, p-value: < 0.00000000000000022
ggplot(data, aes(x=data$OverallQual, y=SalePrice)) + geom_point(col = "purple") + geom_smooth(method='lm', fill = "red") + theme_classic() + theme(axis.text.x = element_text(angle = 45, hjust=1))
The model equation for this is \(Y = -95003.6 + 45251.0*X_{1}\).
Similar to the prior example, the intercept may not be informative here, but the \(B_{1}\) is saying that as the overall quality rating increases, the sale price also increases 45,251.
The interpretation using Overall quality should be bit different. Since the independent variable is not continuous, the interpretation may not be accurate. For example, looking at model1, to say that each unit change in square footage will increase/decrease sale by $x makes sense and is consistent for all square footage unit change.
However, for overall quality, each rating increase in the condition may not necessairly equate to the same dollar change in price because the jump from a rating 1 to 2, vs a rating from 7 to 8 may be different. so it may be true to say that when rating2 from rating1 generates a price increase of $45251, that may not be true from a rating10 to a rating9 standpoint.
The R-squared in this example says that this variable explains 64% of the variance. Again, it may be true and accurate, but the impacts of each specific rating change is still unknown.
\(H_{0} : B_{1} = 0\)
\(H_{1} : B_{1} \neq 0\)
This is to say that the null hypothesis is that overall Quality does not affect sale price.
The alternate hypothesis is to reject the Null.
The omnibus hypotheisis IS the same as we are only using a single variable.
Based on the anova output, the F statistic is significant, so we reject the null hypothesis that the overall quality of above grade is 0.
anova(model2)
## Analysis of Variance Table
##
## Response: SalePrice
## Df Sum Sq Mean Sq F value Pr(>F)
## OverallQual 1 11941155651186 11941155651186 5178.7 < 0.00000000000000022
## Residuals 2928 6751381459165 2305799679
For this (and future exercises), we will rely on the lessR or other packages for the visuals.
Looking at the residuals vs fitted, we see a similar “fanning out” pattern, indicating heteroscedascitity and potential outliers in the upper ends of residuals and leverage variables in the more expensive houses.
reg2<-reg(SalePrice~OverallQual,data=model2_data)
reg2
## >>> Suggestion
## # Create an R markdown file for interpretative output with Rmd = "file_name"
## reg(SalePrice ~ OverallQual, data=model2_data, Rmd="eg")
##
##
## BACKGROUND
##
## Data Frame: model2_data
##
## Response Variable: SalePrice
## Predictor Variable: OverallQual
##
## Number of cases (rows) of data: 2930
## Number of cases retained for analysis: 2930
##
##
## BASIC ANALYSIS
##
## Estimate Std Err t-value p-value Lower 95% Upper 95%
## (Intercept) -95003.551 3933.822 -24.150 0.000 -102716.890 -87290.213
## OverallQual 45251.028 628.805 71.964 0.000 44018.083 46483.973
##
##
## Standard deviation of residuals: 48018.743 for 2928 degrees of freedom
##
## R-squared: 0.639 Adjusted R-squared: 0.639 PRESS R-squared: 0.638
##
## Null hypothesis that all population slope coefficients are 0:
## F-statistic: 5178.748 df: 1 and 2928 p-value: 0.000
##
##
## df Sum Sq Mean Sq F-value p-value
## Model 1 11941155651186.020 11941155651186.020 5178.748 0.000
## Residuals 2928 6751381459165.417 2305799678.677
## SalePrice 2929 18692537110351.438 6381883615.688
##
##
## K-FOLD CROSS-VALIDATION
##
## RELATIONS AMONG THE VARIABLES
##
## SalePrice OverallQual
## SalePrice 1.00 0.80
## OverallQual 0.80 1.00
##
##
## RESIDUALS AND INFLUENCE
##
## Data, Fitted, Residual, Studentized Residual, Dffits, Cook's Distance
## [sorted by Cook's Distance]
## [res_rows = 20, out of 2930 rows of data, or do res_rows="all"]
## ------------------------------------------------------------------------
## OverallQual SalePrice fitted resid rstdnt dffits cooks
## 1768 10 755000 357506.731 397493.269 8.388 0.457 0.102
## 1761 10 745000 357506.731 387493.269 8.172 0.445 0.097
## 2446 10 625000 357506.731 267493.269 5.608 0.305 0.046
## 1064 10 615000 357506.731 257493.269 5.396 0.294 0.043
## 433 10 610000 357506.731 252493.269 5.290 0.288 0.041
## 45 9 611657 312255.702 299401.298 6.282 0.266 0.035
## 1638 9 591587 312255.702 279331.298 5.855 0.248 0.030
## 2451 9 584500 312255.702 272244.298 5.705 0.241 0.029
## 434 9 582933 312255.702 270677.298 5.672 0.240 0.029
## 1499 10 160000 357506.731 -197506.731 -4.130 -0.225 0.025
## 424 10 555000 357506.731 197493.269 4.130 0.225 0.025
## 457 10 552000 357506.731 194493.269 4.067 0.221 0.024
## 2333 9 556581 312255.702 244325.298 5.115 0.216 0.023
## 2331 10 545224 357506.731 187717.269 3.925 0.214 0.023
## 2335 10 535000 357506.731 177493.269 3.710 0.202 0.020
## 2181 10 183850 357506.731 -173656.731 -3.629 -0.198 0.019
## 2182 10 184750 357506.731 -172756.731 -3.610 -0.197 0.019
## 2904 1 81500 -49752.523 131252.523 2.743 0.190 0.018
## 16 8 538000 267004.674 270995.326 5.676 0.176 0.015
## 367 9 501837 312255.702 189581.298 3.962 0.168 0.014
##
##
## FORECASTING ERROR
##
## Data, Predicted, Standard Error of Forecast, 95% Prediction Intervals
## [sorted by lower bound of prediction interval]
## [to see all intervals do pred_rows="all"]
## ------------------------------------------------------------------------------------
## OverallQual SalePrice pred sf pi:lwr pi:upr width
## 766 1 61000 -49752.523 48133.671 -144131.798 44626.752 188758.550
## 1554 1 13100 -49752.523 48133.671 -144131.798 44626.752 188758.550
## ...
## 2929 5 170000 131251.590 48031.871 37071.921 225431.258 188359.337
## 1 6 215000 176502.618 48026.974 82332.552 270672.684 188340.132
## 3 6 172000 176502.618 48026.974 82332.552 270672.684 188340.132
## ...
## 2383 10 465000 357506.731 48089.671 263213.730 451799.731 188586.001
## 2446 10 625000 357506.731 48089.671 263213.730 451799.731 188586.001
## 2667 10 475000 357506.731 48089.671 263213.730 451799.731 188586.001
##
##
## -----------------------------------------------------
## Plot 1: Distribution of Residuals
## Plot 2: Residuals vs Fitted Values
## Plot 3: Reg Line, Confidence and Prediction Intervals
## -----------------------------------------------------
Overall, OverallQual has a higher R squared, but I"m a bit hesitant on the result because of it not being a continuous variable. The Overall Quality may explain more variance, but when thinking about quanitifying impact (inferential), it may not be as useful. If this was a purely prediction exercise, I would use OverallQual. If this was an inferential exercise to understand the impact of specific variables, I would rely on living space square footage.
model3_data <-data[,c("OverallQual","GrLivArea", "SalePrice")]
model3 <- lm(SalePrice~., data=model3_data)
summary(model3)
##
## Call:
## lm(formula = SalePrice ~ ., data = model3_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -393985 -23302 -1227 19949 283509
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -109919.156 3420.959 -32.13 <0.0000000000000002
## OverallQual 33241.400 659.595 50.40 <0.0000000000000002
## GrLivArea 58.754 1.841 31.91 <0.0000000000000002
##
## Residual standard error: 41370 on 2927 degrees of freedom
## Multiple R-squared: 0.732, Adjusted R-squared: 0.7319
## F-statistic: 3998 on 2 and 2927 DF, p-value: < 0.00000000000000022
\(Y = -109919.156 + 33241.400*X1 + 58.754*X2\)
The interpretation would be similar. we would ignore the intercept.
For every unit of Overall quality increase (while holding Gr living area constant), saleprice goes up 33,241, and on the flip slide, when OverallQual is held, for every unit change in living area square footage, sale price increases by 58.75. In this example, we speak to the individual impact, but account for the impact of the other. Something interesting to see is that adding overall quality in our model, the impact of squarefootage is much smaller. Again though, we caveat with explaning the impact of overallQual.
The R squared indicate a better fit at 73.2. both R squared and adjusted r squard indicate it is a better fit then model 1, which had an R2 of 49.95, a difference of .2325 point difference
\(H_{0} : B_{1} = 0\)
\(H_{1} : B_{1} \neq 0\)
\(H_{0} : B_{2} = 0\)
\(H_{1} : B_{2} \neq 0\)
\(H_{0}: B_{1}=B_{2}=0\) All the variables do not have a relationship to saleprice.
\(H_{1}: B_{i} > 0\). There is at least 1 variable that is related to SalePrice. Both variables are significant so we reject the null in all situations.
anova((model3))
## Analysis of Variance Table
##
## Response: SalePrice
## Df Sum Sq Mean Sq F value Pr(>F)
## OverallQual 1 11941155651186 11941155651186 6978.2 < 0.00000000000000022
## GrLivArea 1 1742658594840 1742658594840 1018.4 < 0.00000000000000022
## Residuals 2927 5008722864326 1711213825
reg3<-reg(SalePrice~OverallQual+GrLivArea,data=model3_data)
inspecting the plots, we still see issues with heteroscedasticity (although, an improvement from the prior plot). As is, the variables do break the assumptions of linear regression, but we can continue working with it for a possible transformation. We still do see some outliers in the more expensive homes, so they will need to be addressed.
model4_data <-data[,c("SalePrice", "OverallQual","GrLivArea", "WoodDeckSF")]
model4 <- lm(SalePrice~., data=model4_data)
summary(model4)
##
## Call:
## lm(formula = SalePrice ~ ., data = model4_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -388293 -23729 -1327 18918 293383
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -106703.125 3389.802 -31.478 <0.0000000000000002
## OverallQual 32374.109 656.834 49.288 <0.0000000000000002
## GrLivArea 56.519 1.831 30.873 <0.0000000000000002
## WoodDeckSF 57.838 6.221 9.297 <0.0000000000000002
##
## Residual standard error: 40780 on 2926 degrees of freedom
## Multiple R-squared: 0.7397, Adjusted R-squared: 0.7395
## F-statistic: 2772 on 3 and 2926 DF, p-value: < 0.00000000000000022
The equation for this is \(-106703 + 32374X_{1} + 56.519X_{2} + 57.838X_{3}\)
Similar to the prior examples, all variables are significant and we see Overall quality having the most impact on sale price. Adding the new variable of WoodDeckSF has a similar impact to sale price. It is possible that square footage in general doesnt have the greatest impact on sale price per unit change.
Surprisingly, the R square did not improve much as it is only 73.9%, and improvement of .7 in explanation on the variance. However, looking at the coefficient as they are both similar, it may be because square footage doesnt have a strongly unit impact per sale price and it may be correlated with another predictor.
\(H_{0} : B_{1} = 0\)
\(H_{1} : B_{1} \neq 0\)
\(H_{0} : B_{2} = 0\)
\(H_{1} : B_{2} \neq 0\)
\(H_{0} : B_{3} = 0\)
\(H_{1} : B_{3} \neq 0\)
\(H_{0}: B_{1}=B_{2}=B_{3}=0\) All the variables do not have a relationship to saleprice.
\(H_{1}: B_{i} > 0\). There is at least 1 variable that is related to SalePrice. All 3 variables are significant so we reject the null in all situations.
anova(model4)
## Analysis of Variance Table
##
## Response: SalePrice
## Df Sum Sq Mean Sq F value Pr(>F)
## OverallQual 1 11941155651186 11941155651186 7181.842 < 0.00000000000000022
## GrLivArea 1 1742658594840 1742658594840 1048.098 < 0.00000000000000022
## WoodDeckSF 1 143700943544 143700943544 86.427 < 0.00000000000000022
## Residuals 2926 4865021920782 1662686918
using the new variable of WoodDeckSF, we actually dont see much improvement in the residuals vs predicted. So while we reject the null hypothesis, based on the residual vs fitted, we may not use woodDeck as a variable if we had a second attempt at this.
reg3<-reg(SalePrice~OverallQual + GrLivArea + WoodDeckSF, data = model4_data)
Plots of the log model below
#LNmodel1 <- lm(log(SalePrice)~GrLivArea, data=data_cont)
LNmodel1 <- data_cont[,c("SalePrice","GrLivArea")]
LNmodel1$LogSalePrice <-log(LNmodel1$SalePrice)
LNreg1<-reg(LogSalePrice~GrLivArea,data=LNmodel1)
LNmodel3 <- lm(log(SalePrice)~GrLivArea+OverallQual, data = model3_data)
LNmodel3 <- model3_data[,c("SalePrice","GrLivArea","OverallQual")]
LNmodel3$LogSalePrice <-log(LNmodel3$SalePrice)
LNreg3<-reg(LogSalePrice~GrLivArea+OverallQual,data=LNmodel3)
LNmodel4 <- lm(log(SalePrice)~GrLivArea+OverallQual+WoodDeckSF, data = model4_data)
LNmodel4 <- lm(log(SalePrice)~GrLivArea+OverallQual+WoodDeckSF, data = model4_data)
LNmodel4 <- model4_data[,c("SalePrice","GrLivArea","OverallQual", "WoodDeckSF")]
LNmodel4$LogSalePrice <-log(LNmodel4$SalePrice)
LNreg4<-reg(LogSalePrice~GrLivArea+OverallQual+WoodDeckSF,data=LNmodel4)
### 6
Comparing across models (see below), we see the most improvement in Model4 after transformation. Prior to transformation, our Model4 explains 74% of the variance. After transformation, we see an improvement of 4% to 77%. This suggests that our log transform does help. Looking at the Root mean squared error, we cant compare across models since the log model is in a different unit. However, this is an illustration of how our standard error does decrease as we add more variables.
df <- data.frame("Model"=c("Model1","Model3","Model4", "Model1_log", "Model3_log", "Model4_log"),
"R-Squared"=c(.49,.73,.74,.48,.75,.77), "Adj R-Squared"=c(.49,.73,.74,.48,.76,.77), "RMSE"=c(237.7,203,201.9,.53,.44,.44))
df
## Model R.Squared Adj.R.Squared RMSE
## 1 Model1 0.49 0.49 237.70
## 2 Model3 0.73 0.73 203.00
## 3 Model4 0.74 0.74 201.90
## 4 Model1_log 0.48 0.48 0.53
## 5 Model3_log 0.75 0.76 0.44
## 6 Model4_log 0.77 0.77 0.44
The difference in interpretation for a transformed response variable is that the units are in a logrithmic scale. so you need to transform it back to regular scale once you are complete with model adjustments. Depending on the context, if we are talking about prediction(final result), the scales shouldn’t matter, but if we are talking about explaining, we would need to undo the log to speak to the non-technical audience.
LNmodel4 <- model4_data[,c("SalePrice","GrLivArea","OverallQual", "WoodDeckSF")]
LNmodel4$LogSalePrice <-log(LNmodel4$SalePrice)
LN_reg4<-lm(LogSalePrice~GrLivArea+OverallQual+WoodDeckSF,data=LNmodel4)
t1<-ols_plot_dffits(LN_reg4)
n=2930
p=3
threshold = 2*sqrt((3+1)/(2930-3-1))
threshold
## [1] 0.07394739
dffitsData <- t1$data
LNmodel4$Dffits_indicator <-dffitsData$dbetas
updatedData <- LNmodel4[which(LNmodel4$Dffits_indicator <= threshold & LNmodel4$Dffits_indicator >= -threshold),]
str(updatedData)
## 'data.frame': 2792 obs. of 6 variables:
## $ SalePrice : int 215000 105000 172000 244000 189900 195500 213500 191500 236500 189000 ...
## $ GrLivArea : int 1656 896 1329 2110 1629 1604 1338 1280 1616 1804 ...
## $ OverallQual : int 6 5 6 7 5 6 8 8 8 7 ...
## $ WoodDeckSF : int 210 140 393 0 212 360 0 0 237 140 ...
## $ LogSalePrice : num 12.3 11.6 12.1 12.4 12.2 ...
## $ Dffits_indicator: num 0.026475 -0.019653 0.000818 0.016757 0.045278 ...
threshold
## [1] 0.07394739
With the updated data, we see an improvement of Rsquared to 82.8%, which is a 7% improvement. This occurs after we removed the influential points of about 138 points~ about 5% of the data.
The idea of removing the influential points appears to be an acceptable action because of the improvement and also because of the it only being about 5% of the total sample. Keeping in mind that we are still trying to predict a ‘typical’ home price, the using 95% of the data still tells the story.
UpdatdLN_reg4<-lm(LogSalePrice~GrLivArea+OverallQual+WoodDeckSF,data=updatedData)
summary(UpdatdLN_reg4)
##
## Call:
## lm(formula = LogSalePrice ~ GrLivArea + OverallQual + WoodDeckSF,
## data = updatedData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.55709 -0.10070 0.00867 0.10878 0.53464
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.544122065 0.013742501 767.26 <0.0000000000000002
## GrLivArea 0.000290128 0.000007699 37.68 <0.0000000000000002
## OverallQual 0.167840651 0.002685298 62.50 <0.0000000000000002
## WoodDeckSF 0.000336095 0.000025997 12.93 <0.0000000000000002
##
## Residual standard error: 0.1561 on 2788 degrees of freedom
## Multiple R-squared: 0.8283, Adjusted R-squared: 0.8281
## F-statistic: 4484 on 3 and 2788 DF, p-value: < 0.00000000000000022
Based on my plotsI will consider:
FrstFlrsquarefootage, GrLivArea, WoodDeckSF, LotArea, secondFlr sF, Openproach SF as those visually and numerically show correlation to sale price. Because the \(R^2\) is low overall, most of the variables I chose are still “low”.
Of these, I’m keeping GRLivArea, WoodDeckSF, LotArea, Open Porch. From an \(R^2\) perspective, these have lower correlation coefficients (to eachother). It also appears to make sense as I’m choosing characteristics for different portions of the home (living area, deck, land, and backyard)
corr <-cor(data_cont)
ggcorrplot(corr,
type ="lower",
lab = TRUE,
lab_size=3,
method = "circle",
colors = c("red", "white", "purple"),
title = "Correlogram of Housing prices",
ggtheme = theme_classic)
model5_data <- data[,c("SalePrice","GrLivArea","WoodDeckSF","LotArea","OpenPorchSF", "OverallQual")]
model5_data$LogSalePrice <- log(model5_data$SalePrice)
The \(R^{2}\) explains 77.4% of the variance.
model5_raw <-lm(LogSalePrice~GrLivArea+WoodDeckSF+LotArea+OpenPorchSF+OverallQual,data=model5_data)
summary(model5_raw)
##
## Call:
## lm(formula = LogSalePrice ~ GrLivArea + WoodDeckSF + LotArea +
## OpenPorchSF + OverallQual, data = model5_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.02276 -0.09395 0.01784 0.12139 0.62136
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.4937466507 0.0166185521 631.448 <0.0000000000000002
## GrLivArea 0.0002261516 0.0000092421 24.470 <0.0000000000000002
## WoodDeckSF 0.0002791945 0.0000298244 9.361 <0.0000000000000002
## LotArea 0.0000050654 0.0000004788 10.579 <0.0000000000000002
## OpenPorchSF 0.0001431600 0.0000571420 2.505 0.0123
## OverallQual 0.1810833272 0.0031714448 57.098 <0.0000000000000002
##
## Residual standard error: 0.1939 on 2924 degrees of freedom
## Multiple R-squared: 0.7742, Adjusted R-squared: 0.7738
## F-statistic: 2005 on 5 and 2924 DF, p-value: < 0.00000000000000022
anova(model5_raw)
## Analysis of Variance Table
##
## Response: LogSalePrice
## Df Sum Sq Mean Sq F value Pr(>F)
## GrLivArea 1 235.617 235.617 6269.898 < 0.00000000000000022
## WoodDeckSF 1 13.163 13.163 350.288 < 0.00000000000000022
## LotArea 1 0.930 0.930 24.755 0.0000006889
## OpenPorchSF 1 4.480 4.480 119.203 < 0.00000000000000022
## OverallQual 1 122.515 122.515 3260.188 < 0.00000000000000022
## Residuals 2924 109.881 0.038
Unfortunately, scaling the data did not do much improvement to the \(R^2\) (77.4%). Because of this, I think using the standard units will make more sense from an interprability standpoint.
scaled_data_x <- model5_data[,c("GrLivArea","WoodDeckSF","LotArea","OpenPorchSF", "OverallQual")]
y <- model5_data[,"LogSalePrice"]
model5_data_scaled <- data.frame(lapply(scaled_data_x, function(x) scale(x,center=TRUE, scale=TRUE)))
final_data<-cbind(y,model5_data_scaled)
final_reg <- reg(y~GrLivArea+WoodDeckSF+LotArea+OpenPorchSF+OverallQual,data=final_data)
final_reg
## >>> Suggestion
## # Create an R markdown file for interpretative output with Rmd = "file_name"
## reg(y ~ GrLivArea + WoodDeckSF + LotArea + OpenPorchSF + OverallQual, data=final_data, Rmd="eg")
##
##
## BACKGROUND
##
## Data Frame: final_data
##
## Response Variable: y
## Predictor Variable 1: GrLivArea
## Predictor Variable 2: WoodDeckSF
## Predictor Variable 3: LotArea
## Predictor Variable 4: OpenPorchSF
## Predictor Variable 5: OverallQual
##
## Number of cases (rows) of data: 2930
## Number of cases retained for analysis: 2930
##
##
## BASIC ANALYSIS
##
## Estimate Std Err t-value p-value Lower 95% Upper 95%
## (Intercept) 12.0209687 0.0035813 3356.605 0.000 12.0139466 12.0279908
## GrLivArea 0.1143217 0.0046720 24.470 0.000 0.1051610 0.1234823
## WoodDeckSF 0.0352795 0.0037687 9.361 0.000 0.0278900 0.0426689
## LotArea 0.0399156 0.0037731 10.579 0.000 0.0325173 0.0473139
## OpenPorchSF 0.0096609 0.0038561 2.505 0.012 0.0020999 0.0172219
## OverallQual 0.2555133 0.0044750 57.098 0.000 0.2467388 0.2642878
##
##
## Standard deviation of residuals: 0.1938532 for 2924 degrees of freedom
##
## R-squared: 0.774 Adjusted R-squared: 0.774 PRESS R-squared: 0.771
##
## Null hypothesis that all population slope coefficients are 0:
## F-statistic: 2004.866 df: 5 and 2924 p-value: 0.000
##
##
## df Sum Sq Mean Sq F-value p-value
## GrLivArea 1 235.6169427 235.6169427 6269.8976109 0.000
## WoodDeckSF 1 13.1634836 13.1634836 350.2876029 0.000
## LotArea 1 0.9302501 0.9302501 24.7544721 0.000
## OpenPorchSF 1 4.4795301 4.4795301 119.2027827 0.000
## OverallQual 1 122.5148388 122.5148388 3260.1878549 0.000
## Model 5 376.7050453 75.3410091 2004.8660647 0.000
##
## Residuals 2924 109.8812107 0.0375791
##
## y 2929 486.5862561 0.1661271
##
##
## K-FOLD CROSS-VALIDATION
##
## RELATIONS AMONG THE VARIABLES
##
## y GrLivArea WoodDeckSF LotArea OpenPorchSF OverallQual
## y 1.00 0.70 0.33 0.26 0.32 0.83
## GrLivArea 0.70 1.00 0.25 0.29 0.34 0.57
## WoodDeckSF 0.33 0.25 1.00 0.16 0.04 0.26
## LotArea 0.26 0.29 0.16 1.00 0.10 0.10
## OpenPorchSF 0.32 0.34 0.04 0.10 1.00 0.30
## OverallQual 0.83 0.57 0.26 0.10 0.30 1.00
##
##
## Tolerance VIF
## GrLivArea 0.588 1.701
## WoodDeckSF 0.903 1.107
## LotArea 0.901 1.110
## OpenPorchSF 0.863 1.159
## OverallQual 0.641 1.561
##
##
## GrLivArea WoodDeckSF LotArea OpenPorchSF OverallQual R2adj X's
## 1 1 1 1 1 0.774 5
## 1 1 1 0 1 0.773 4
## 1 0 1 1 1 0.767 4
## 1 0 1 0 1 0.767 3
## 1 1 0 1 1 0.765 4
## 1 1 0 0 1 0.765 3
## 1 0 0 1 1 0.757 3
## 1 0 0 0 1 0.756 2
## 0 1 1 1 1 0.728 4
## 0 1 1 0 1 0.723 3
## 0 0 1 1 1 0.716 3
## 0 0 1 0 1 0.712 2
## 0 1 0 1 1 0.704 3
## 0 1 0 0 1 0.697 2
## 0 0 0 1 1 0.687 2
## 0 0 0 0 1 0.682 1
## 1 1 1 1 0 0.522 4
## 1 1 0 1 0 0.520 3
## 1 1 1 0 0 0.513 3
## 1 1 0 0 0 0.511 2
## 1 0 1 1 0 0.495 3
## 1 0 0 1 0 0.492 2
## 1 0 1 0 0 0.487 2
## 1 0 0 0 0 0.484 1
## 0 1 1 1 0 0.235 3
## 0 1 0 1 0 0.205 2
## 0 1 1 0 0 0.153 2
## 0 0 1 1 0 0.152 2
## 0 1 0 0 0 0.111 1
## 0 0 0 1 0 0.102 1
## 0 0 1 0 0 0.065 1
##
## [based on Thomas Lumley's leaps function from the leaps package]
##
##
##
## RESIDUALS AND INFLUENCE
##
## Data, Fitted, Residual, Studentized Residual, Dffits, Cook's Distance
## [sorted by Cook's Distance]
## [res_rows = 20, out of 2930 rows of data, or do res_rows="all"]
## ---------------------------------------------------------------------------------------------------------------------------------
## GrLivArea WoodDeckSF LotArea OpenPorchSF OverallQual y fitted resid rstdnt dffits cooks
## 1499 8.1943358 0.9516195 6.8196646 3.6226176 2.7675742 11.9829291 14.0056921 -2.0227630 -10.8192794 -1.9969220 0.6392400
## 957 1.0609300 -0.7419335 26.0274893 -0.7043724 0.6414619 12.8346813 13.3120807 -0.4773994 -2.8568330 -1.6717325 0.4646400
## 2181 7.1122579 3.5790007 3.6982249 6.4677617 2.7675742 12.1218755 13.8775724 -1.7556969 -9.3208180 -1.5852155 0.4068700
## 2182 6.2833901 0.9041367 3.8002552 5.3119219 2.7675742 12.1267588 13.6813532 -1.5545944 -8.1911686 -1.1649043 0.2211700
## 1571 0.5663789 4.5603118 19.6080876 -0.7043724 -0.7759464 12.3412589 12.8242023 -0.4829433 -2.6953089 -1.1047342 0.2029700
## 727 -1.5423872 -0.7419335 -0.2879336 7.0456816 -1.4846505 10.4602421 11.4956922 -1.0354501 -5.4343615 -0.8610147 0.1223600
## 2116 1.2745761 -0.7419335 18.8898151 0.7478365 -0.0672422 12.5317728 12.8845470 -0.3527742 -1.9539431 -0.7611590 0.0964700
## 1554 -1.5166706 -0.7419335 0.5629528 -0.7043724 -3.6107628 9.4803675 10.9144732 -1.4341057 -7.4865501 -0.5392086 0.0475600
## 182 -1.3208283 -0.7419335 -0.0624265 -0.7043724 -2.9020587 9.4563407 11.0929831 -1.6366424 -8.5604587 -0.4911544 0.0392400
## 2072 0.6415507 2.2653101 13.3249799 0.0069136 0.6414619 12.6181823 12.8700740 -0.2518917 -1.3444756 -0.3564682 0.0211700
## 1611 0.6652891 -0.7419335 5.8949205 -0.7043724 -0.7759464 11.5424843 12.1010804 -0.5585961 -2.9054323 -0.3446331 0.0197500
## 1183 2.8571398 0.3659983 1.8304626 0.3329197 2.0588701 11.9183906 12.9628629 -1.0444723 -5.4248814 -0.3427887 0.0194000
## 1403 0.3230597 1.0307575 5.4668758 1.3257564 -1.4846505 12.4529327 11.9459396 0.5069931 2.6350974 0.3016304 0.0151300
## 2044 -0.0310389 -0.6311403 0.0218373 2.4223224 -1.4846505 10.8197783 11.6400796 -0.8203013 -4.2529152 -0.2772820 0.0127400
## 1321 2.5228232 0.0890154 0.4253897 6.7344940 2.7675742 12.6915805 13.1017156 -0.4101351 -2.1342588 -0.2732020 0.0124200
## 2911 2.0203592 5.0193122 -0.7847599 -0.4080033 -0.0672422 11.9276806 12.3765710 -0.4488903 -2.3318266 -0.2608144 0.0113200
## 2196 2.1608118 -0.7419335 0.0941721 10.2909242 -0.7759464 11.9183906 12.1467354 -0.2283448 -1.2039219 -0.2538657 0.0107400
## 2294 -0.2842491 10.5273162 0.7371910 -0.7043724 -0.0672422 12.1441972 12.3653101 -0.2211128 -1.1654485 -0.2441374 0.0099300
## 754 1.2429248 -0.7419335 -0.9070946 0.1254613 -0.7759464 11.0429218 11.9036271 -0.8607053 -4.4607614 -0.2401771 0.0095500
## 1570 0.3744930 2.6451725 -0.0200408 1.0738426 -1.4846505 12.3392915 11.7873280 0.5519635 2.8590746 0.2177606 0.0078800
##
##
## FORECASTING ERROR
##
## Data, Predicted, Standard Error of Forecast, 95% Prediction Intervals
## [sorted by lower bound of prediction interval]
## [to see all intervals do pred_rows="all"]
## -------------------------------------------------------------------------------------------------------------------------------
## GrLivArea WoodDeckSF LotArea OpenPorchSF OverallQual y pred sf pi:lwr pi:upr width
## 1902 -2.3059742 -0.7419335 -0.6532881 -0.7043724 -3.6107628 10.5789798 10.7756917 0.1943320 10.3946503 11.1567331 0.7620829
## 1554 -1.5166706 -0.7419335 0.5629528 -0.7043724 -3.6107628 9.4803675 10.9144732 0.1943528 10.5333910 11.2955554 0.7621644
## 766 -1.1783976 -0.7419335 -0.0162337 -0.7043724 -3.6107628 11.0186291 10.9300265 0.1943562 10.5489376 11.3111155 0.7621779
## ...
## 789 0.3744930 -0.7419335 -0.0371727 -0.7043724 -0.0672422 11.7558716 12.0121364 0.1939446 11.6318545 12.3924183 0.7605637
## 1729 0.0718277 0.0494464 0.0113043 -0.1412710 -0.0672422 12.0782393 12.0128297 0.1938880 11.6326589 12.3930005 0.7603416
## 162 -0.5453721 2.5343793 -0.2644565 -0.7043724 -0.0672422 11.7905572 12.0134903 0.1941579 11.6327901 12.3941904 0.7614003
## ...
## 2451 3.9570215 6.1430716 0.9002617 0.5700150 2.0588701 13.2785121 13.2575762 0.1953979 12.8744448 13.6407075 0.7662627
## 957 1.0609300 -0.7419335 26.0274893 -0.7043724 0.6414619 12.8346813 13.3120807 0.2171743 12.8862506 13.7379109 0.8516603
## 2667 4.1706676 -0.7419335 1.6246255 3.1484269 2.7675742 13.0710701 13.2740079 0.1946960 12.8922528 13.6557630 0.7635102
##
##
## ----------------------------------
## Plot 1: Distribution of Residuals
## Plot 2: Residuals vs Fitted Values
## Plot 3: ScatterPlot Matrix
## ----------------------------------
Using the non-scaled data, the equation is: \(Y=10.493+.00226X_{1}+ .00226X_{2}+ .00226X_{3}+ .00226X_{4}+ .00226X_{5}\). Interpreting the coefficients is not reccomednded as we’ll need to undo the log, so we wont interpret the coeffient, but we do see significance in all the variables. We are also get to reject the null hyptohesis in the t-statistic and the omnibus test.
looking at the residuals plots:
Residual vs leverage display a few points outside of the cooks distance, which is an indication of influence of high leverage. Also, atleast between 0.00 to 0.05, the spread of standaridized residuals do seem to decrease just a bit.
Normal QQ plot appears to tell us that the data is not normally distributed.
scale location tells us the spread of the residuals among the range of predictors.we do see a decently straight line with random points, so I’m tending to lean on this satisfying homoscaedcisty.
Residuals vs fitted do show a fairly straight line and random points, so this tells us that that there are not any non-linear relationships.
plot(model5_raw)
Now we run DFFITS again on this data and remove outliers/
t2<-ols_plot_dffits(model5_raw)
n=2930
p=5
threshold = 2*sqrt((3+1)/(2930-3-1))
threshold
## [1] 0.07394739
dffitsData <- t2$data
model5_data$Dffits_indicator <-dffitsData$dbetas
updatedData5 <- model5_data[which(model5_data$Dffits_indicator <= threshold & model5_data$Dffits_indicator >= -threshold),]
str(updatedData5)
## 'data.frame': 2734 obs. of 8 variables:
## $ SalePrice : int 215000 105000 172000 244000 189900 195500 213500 191500 236500 189000 ...
## $ GrLivArea : int 1656 896 1329 2110 1629 1604 1338 1280 1616 1804 ...
## $ WoodDeckSF : int 210 140 393 0 212 360 0 0 237 140 ...
## $ LotArea : int 31770 11622 14267 11160 13830 9978 4920 5005 5389 7500 ...
## $ OpenPorchSF : int 62 0 36 0 34 36 0 82 152 60 ...
## $ OverallQual : int 6 5 6 7 5 6 8 8 8 7 ...
## $ LogSalePrice : num 12.3 11.6 12.1 12.4 12.2 ...
## $ Dffits_indicator: num 0.02761 -0.02369 -0.00338 0.02404 0.04524 ...
updatedData5_reg5<-lm(LogSalePrice~GrLivArea+WoodDeckSF+LotArea+OpenPorchSF+OverallQual,data=updatedData5)
summary(updatedData5_reg5)
##
## Call:
## lm(formula = LogSalePrice ~ GrLivArea + WoodDeckSF + LotArea +
## OpenPorchSF + OverallQual, data = updatedData5)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.48807 -0.09094 0.00701 0.10065 0.48105
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.4970767479 0.0133353759 787.160 < 0.0000000000000002
## GrLivArea 0.0002409056 0.0000074306 32.421 < 0.0000000000000002
## WoodDeckSF 0.0002834005 0.0000237375 11.939 < 0.0000000000000002
## LotArea 0.0000109425 0.0000006258 17.485 < 0.0000000000000002
## OpenPorchSF 0.0002985507 0.0000495421 6.026 0.0000000019
## OverallQual 0.1687696610 0.0025012476 67.474 < 0.0000000000000002
##
## Residual standard error: 0.1396 on 2728 degrees of freedom
## Multiple R-squared: 0.8595, Adjusted R-squared: 0.8592
## F-statistic: 3337 on 5 and 2728 DF, p-value: < 0.00000000000000022
The updated \(R^{2}\) is 85%
In general, transformation and data removal appears to help in overall model performance. While a benefit on the surface, the tradeoff is interprtability as we log and transform the data and we reduce datapoints. I was surprised though, that scaling did not do much to the result.
The hypothesis tests I do see as a good guide, but as I dug deeper, I feel it does seem to bring subjectivity in how to pick model variables. For example, while a statistic would say a variable is signficant, the scatterplot may tempt me to not pick it, or vice versa. So far, every variable I picked said it is statistically signficant, so it does bring a bit of doubt as “too good to be true.”
As for next steps, I think an automated variable selection method would be the logical next step to do some of the manual work. my variable selections were chosen purely based on a high level look, so having an automated way to look at EVERY \(R^{2}\) difference of all predictor combination will help.
####Initial histogram chart of continuous variables.
Normally distributed for some while some are skewed. While independent variables are not assumed to be normal for a linear regression model, the skewness can tell us a bit about potential outliers and influential variables.
options(scipen = 999)
for(i in 1:length(data_cont)){
print(
ggplot(data_cont, aes(x=data_cont[,i])) + geom_histogram(fill = "purple") + theme_classic() +
labs(x=colnames(data_cont[i]),
y="Frequency",title = paste0("Frequency of"," ",colnames(data_cont[i]))))
}
###Scatter plot of continuous variables to saleprice.
The scatter plot allows us to visually see correlation to sale price. anything that doesn’t look correlated may be considered to be removed.
***
for(i in 1:length(data_cont)){
print(
ggplot(data_cont, aes(x=data_cont[,i], y=SalePrice)) + geom_point(col = "purple") + geom_smooth(method='lm', fill = "red") + theme_classic() + theme(axis.text.x = element_text(angle = 45, hjust=1))+
labs(x=colnames(data_cont[i]),
y="Sale Price",title = paste0("Scatterplot of"," ",colnames(data_cont[i]))))
}
###Correlation matrix
***
The correlation matrix confirms our scatter plots and shows which ones have the highest correlation coefficient. a high \(R^{2}\) may be an indication of the variable importance.
corr <-cor(data_cont)
ggcorrplot(corr,
type ="lower",
lab = TRUE,
lab_size=3,
method = "circle",
colors = c("red", "white", "purple"),
title = "Correlogram of Housing prices",
ggtheme = theme_classic)
### Q-Q plot
The Q-Q plot reinforces the histogram view and shows us which ones are ‘normal’.
for(i in 1:length(data_cont)){
if (class(data_cont[,i]) == "integer"){
print(
ggplot(data_cont, aes(sample=data_cont[,i])) +
stat_qq(col = "purple") +
stat_qq_line(col = "red") + theme_classic() + theme(axis.text.x = element_text(angle = 45, hjust=1))+
labs(x="theoritical",
y="sample",title = paste0("Q-Q to test for Normality of "," ",colnames(data_cont[i]))))
} else {
next
}
}